To see a practical application of 2SLS in R, we will follow a paper by Card (1995)1 on the effect of education on wages.
Card uses college proximity in the US as an IV for education
Data are from the National Longitudinal Survey of Youth (NLSY) 1966
Data for about 5,500 men
The 2SLS in R
# Load necessary librarieslibrary(AER)library(haven)library(tidyverse)# Define a function to read data from a specified URLread_data <-function(df) {# Construct the full URL full_path <-paste("https://github.com/scunning1975/mixtape/raw/master/", df, sep ="")# Read the .dta file from the URL df <-read_dta(full_path)return(df)}# Use the function to read the 'card.dta' datasetcard <-read_data("card.dta")# Attach the dataframe to make its columns directly accessible as variablesattach(card)# Define the variables for the regression analysesY1 <- lwage # Dependent variableY2 <- educ # Endogenous variableX1 <-cbind(exper, black, south, married, smsa) # Exogenous variablesX2 <- nearc4 # Instrument# Perform an OLS regressionols_reg <-lm(Y1 ~ Y2 + X1) # Y1 is the dependent variable, Y2 and X1 are the independent variablessummary(ols_reg) # Display the results of the OLS regression# Perform a 2SLS regressioniv_reg =ivreg(Y1 ~ Y2 + X1 | X1 + X2) # Y1 is the dependent variable, Y2 and X1 are the independent variables, X1 and X2 are the instrumentssummary(iv_reg) # Display the results of the 2SLS regression
Call:
lm(formula = Y1 ~ Y2 + X1)
Residuals:
Min 1Q Median 3Q Max
-1.59924 -0.23035 0.01812 0.23046 1.36797
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.063317 0.063740 79.437 <2e-16 ***
Y2 0.071173 0.003482 20.438 <2e-16 ***
X1exper 0.034152 0.002214 15.422 <2e-16 ***
X1black -0.166027 0.017614 -9.426 <2e-16 ***
X1south -0.131552 0.014969 -8.788 <2e-16 ***
X1married -0.035871 0.003401 -10.547 <2e-16 ***
X1smsa 0.175787 0.015458 11.372 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3702 on 2996 degrees of freedom
(7 observations deleted due to missingness)
Multiple R-squared: 0.305, Adjusted R-squared: 0.3036
F-statistic: 219.2 on 6 and 2996 DF, p-value: < 2.2e-16
The 2SLS in R
# Load necessary librarieslibrary(AER)library(haven)library(tidyverse)# Define a function to read data from a specified URLread_data <-function(df) {# Construct the full URL full_path <-paste("https://github.com/scunning1975/mixtape/raw/master/", df, sep ="")# Read the .dta file from the URL df <-read_dta(full_path)return(df)}# Use the function to read the 'card.dta' datasetcard <-read_data("card.dta")# Attach the dataframe to make its columns directly accessible as variablesattach(card)# Define the variables for the regression analysesY1 <- lwage # Dependent variableY2 <- educ # Endogenous variableX1 <-cbind(exper, black, south, married, smsa) # Exogenous variablesX2 <- nearc4 # Instrument# Perform an OLS regressionols_reg <-lm(Y1 ~ Y2 + X1) # Y1 is the dependent variable, Y2 and X1 are the independent variablessummary(ols_reg) # Display the results of the OLS regression# Perform a 2SLS regressioniv_reg =ivreg(Y1 ~ Y2 + X1 | X1 + X2) # Y1 is the dependent variable, Y2 and X1 are the independent variables, X1 and X2 are the instrumentssummary(iv_reg) # Display the results of the 2SLS regression
\(Y_{it}\) is the wage of local workers in location \(i\) at time \(t\)
\(I_{it}\) is the share of migrants in the local labor market
\(X_{it}\) is a vector of local labor market characteristics
Problem: \(I_{it}\) is endogenous to local labor market conditions
Solution: construct an instrument that predicts the inflow of migrants to location \(i\) at time \(t\) on the basis of the national flow of migrants at \(t\) and local share of migrants in the past (\(t_0\))
The Bartik IV: How to
\[
B_{it}=\sum_{k=1}^{K}z_{ikt^0}m_{kt}
\]
\(B_{it}\) is the predicted inflow of migrants to location \(i\) at time \(t\)
\(z_{ikt^0}\) is the share of migrants from location \(k\) at time \(t_0\)
\(m_{kt}\) is the national flow of migrants from location \(k\) at time \(t\)
The instrument is the predicted inflow of migrants to location \(i\) at time \(t\)
The instrumnent is a weighted average of the national flow of migrants from location \(k\) at time \(t\) where the weights are the share of migrants from location \(k\) at time \(t_0\)
The exclusion restriction is that the national flow of migrants from location \(k\) at time \(t\) is uncorrelated with local share of migrants at time \(t^0\)